# Faster Transformer: CUDA-Centric **BERT Inference Optimization**

이미선 **NVIDIA** 







#### CONTENTS

- 1. Background and Motivation
- 3. Evaluation
- 4. Faster Transformer Repository



#### 2. Performance Analysis/Optimization of BERT Inference on GPU







# 1. Background





## 1.1 What is BERT?

#### One of the Most Popular Large-Scale Language Model

- Based on Transformer Encoder



 Provide a leap in accuracy for various NLP tasks beyond conversational AI - Companies across industries are trying to use the model in production



## 1.2 Challenge in Production

- Quality of Service: Accuracy + Latency
- BERT requires significant amounts of computation during inference
- - Obstacle for companies to deploy BERT in its real-time applications





- BERT-base latency on **CPU**
- # Layers: 12
- Batch Size: 1  $\bullet$
- # Heads: 12 ullet
- Size per Head: 64

### 1.3 Characteristics of Inference

- Compute capability possibly different from Training's e.g., training with multiple V100s vs. inference with a single T4
- No backward pass
- Inference-specific optimization is necessary and possible





# 2. Performance Analysis of **BERT Inference on GPU**





# 2.1 Purpose of Analysis

- To check if there exists opportunities for latency reduction and get some hints for the performance optimization - To verify if the applied techniques are really effective
- Profiling tools such as Nsight Systems can be useful





#### 2.2 BERT Encoder Cell



DEVIEW 2019



## 2.2.1 Profiling BERT Encoder Cell

- 1 encoder cell leads to > 40 CUDA kernels!







# 2.3. GELU



### 2.3.1 GELU Activation

- But how about performance?

```
def gelu(x):
cdf = 0.5 * (1.0 + tf.tanh(
 return x * cdf
```





#### - Easy-to-write element-wise op in Tensorflow by compositing existing ops

#### (np.sqrt(2 / np.pi) \* (x + 0.044715 \* tf.pow(x, 3))))

## 2.3.2 Profiling GELU on GPU

- GELU consists of 8 CUDA kernels
- Aggregated runtime is almost equivalent to W\*x+b







### 2.3.3 Memory Access of naïve GELU







#### 2.3.4 Kernel Launch Overhead





#### Kernel Launch is not perfectly free (5~10 us)







#### time

#### **Global Memory**



#### 2.3.5 Fused GELU



#### time



### 2.3.6 Fused GELU CUDA C++ Function

- All the operations are done in on-chip registers







# 2.4. addBias + GELU



### 2.4.1 BERT Encoder Cell revisited

Fully-Connect Layer

- Wx: highly optimized CUBLAS GEMM
- +b: simple addBias CUDA kernel -







### 2.4.2 Fusion of addBias and GELU

- Simply call GELU device function inside your addBias kernel!
- Wx still relies upon CUBLAS GEMM





# 2.5. addBias + LayerNorm





## 2.5.1 BERT Encoder Cell revisited

- Residual connection
- addBias
- Layer Normalization







### 2.5.2 Fused Layer Normalization

Residual connection + addBias + LayerNorm





# 2.6. Multi-Head Attention





### 2.6.1 Multi-Head Attention

#### Input: (BxSxNxH) —







### 2.6.1 Multi-Head Attention

#### Input: (BxSxNxH) \_







### 2.6.2 Fusion of addBias and Transpose





#### Improve thread-level parallelism, launch overhead and memory efficiency

### 2.6.3 Scale, Mask and Softmax

- Scale and Mask are element-wise operations





#### 2.6.5 Fused Softmax









### 2.6.6 CUDA Thread/Memory Hierarchy

thread < local memory











warp (32 threads) - CUDA provides useful primitives for warp-level data exchange

shared memory

global memory

### 2.6.7 Softmax Implementation Sketch

#### Sequence Length







### 2.6.7 Softmax Implementation Sketch

Sequence Length



sum\_val (in shared memory)





# 2.6.7 Softmax Implementation Sketch



e<sup>masked\_val</sup> / s\_sum (in local memory)





## 2.6.8 Task-Specific Optimization

Input Tensor Shape: [batch\_size, head\_num, seq\_len, seq\_len]

Tasks with large batch sizes: already have enough thread blocks

Tasks with small batch sizes: need to improve # thread blocks - # blocks: (batch\_size x head\_num x seq\_len), block size: (seq\_len)



- # blocks: (batch\_size x head\_num), block size: (seq\_len x seq\_len)

#### 2.7 Fused Multi-Head Attention

Wx and MatMul are the calls to highly optimized CUBLAS GEMMs







## 2.7.1 Profiling FasterTransformer









## 2.8. Lower Precision



#### 2.8.2 Tensor Cores in NVIDIA GPUs

- Matrix-multiply-and-accumulate units available in Volta/Turing Archs — - To improve throughput by using lower precisions, e.g., FP16

|              |                  |                  |                  |                  | N /              |                 |
|--------------|------------------|------------------|------------------|------------------|------------------|-----------------|
|              | A <sub>0,0</sub> | A <sub>0,1</sub> | A <sub>0,2</sub> | A <sub>0,3</sub> | $\left  \right $ | B <sub>0,</sub> |
| <b>n</b> –   | A <sub>1,0</sub> | A <sub>1,1</sub> | A <sub>1,2</sub> | A <sub>1,3</sub> |                  | B <sub>1,</sub> |
|              | A <sub>2,0</sub> | A <sub>2,1</sub> | A <sub>2,2</sub> | A <sub>2,3</sub> |                  | B <sub>2,</sub> |
|              | A <sub>3,0</sub> | A <sub>3,1</sub> | A <sub>3,2</sub> | A <sub>3,3</sub> |                  | B <sub>3,</sub> |
| FP16 or FP32 | *                | FP               | 16               |                  | <i>*</i> *       |                 |





#### 2.8.3 Enable FP16 in cuBLAS GEMMs



| Atype/Btype | Ctype      |  |  |
|-------------|------------|--|--|
| CUDA_R_32F  | CUDA_R_32F |  |  |
| CUDA_R_16F  | CUDA_R_16F |  |  |
| CUDA_R_16F  | CUDA_R_32F |  |  |



### 2.8.4 Profiling Again





#### - Reduced the latency of GEMMs by halving the precision (FP32->FP16)





## **3.** Evaluation



## 3.1. Methodology

- Baseline: Tensorflow w/ XLA
- Model: BERT Base
- GPU: V100, P4, and T4
- CPU: Intel Xeon Gold 6132 CPU @ 2.60GHz





#### 3.2. Performance Comparison on V100

Lower is Better



- # Layers: 12
- Sequence Length: 32
- # Heads: 12
- Size per Head: 64

500

- Memory Clock: 877MHz
- Processor Clock: 1380MHz



#### 3.3. Faster Transformer on T4

#### Lower is Better



- # Layers: 12
- Batch Size: 1
- # Heads: 12
- Size per Head: 64
- Memory Clock: 5000MHz
- Processor Clock: 1590MHz





# 4. Faster Transformer Repository

#### 4.1. Code Structure

- All that I explained has been open-sourced as Faster Transformer! https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer



CUDA C++ Implementation, Tensorflow and TensorRT interfaces Sample illustrating how to use it in C++, Tensorflow, and TensorRT



### 4.2. Use Case: PingAn's PA-Occam-Bert

| Rank             | 1-example Latency<br>(milliseconds) | Model                                                                              | Framewo             |
|------------------|-------------------------------------|------------------------------------------------------------------------------------|---------------------|
| 1<br>Jul<br>2019 | <b>7.5790</b>                       | PA-Occam-Bert<br>Ping An Technology Occam<br>Platform<br>source                    | Tensorflo<br>1.13.0 |
| 2<br>Feb<br>2019 | 7.9000                              | FastFusionNet<br><i>Wu et al. (Cornell,</i><br><i>SayMosaic, Google)</i><br>source | Pytorch<br>v0.3.1   |
| 3<br>Oct<br>2017 | 100.0000                            | BiDAF<br><i>Stanford DAWN</i><br>source                                            | TensorFlo<br>v1.2   |

https://dawn.cs.stanford.edu/benchmark/#squad-inference-time 



| ork | - | DAWNBench SQuAD                         |
|-----|---|-----------------------------------------|
| SW  | - | F1 score 75.80                          |
|     | - | Faster Transformer Integration          |
| h   |   | on 1X Tesla V100                        |
| _   | - | <u>https://github.com/geekerzli/PA-</u> |
| OW  |   | <u>Occam-Bert</u>                       |
|     |   |                                         |

## 4.3. Upcoming Faster Transformer 2.0

- Transformer Decoder will be included!
- Your feedback is highly appreciated https://github.com/NVIDIA/DeepLearningExamples/issues





## Q&A





# Thank You



